Skip to content

Eliminate HEAD requests during downloads, especially for faster transfers of small files#363

Open
crowecawcaw wants to merge 6 commits into
boto:developfrom
crowecawcaw:callback
Open

Eliminate HEAD requests during downloads, especially for faster transfers of small files#363
crowecawcaw wants to merge 6 commits into
boto:developfrom
crowecawcaw:callback

Conversation

@crowecawcaw
Copy link
Copy Markdown

@crowecawcaw crowecawcaw commented Nov 21, 2025

This PR optimizes downloads through the TransferManager by removing the upfront HEAD request on an opt-in basis. Previously, every download issued a HEAD request to determine object size before starting the GET. This change eliminates that extra round-trip by extracting metadata from the first GET response instead, when the caller has configured response_checksum_validation='when_required' on the client. The default path still uses HEAD so that full-object checksum validation continues to work.

For small files, the download time is dominated by request latency, so eliminating one of the two requests results in a ~50% download time reduction. For large files, the effect is less noticeable because the download time is dominated by the transfer itself and there are multiple chunks. In both cases, we save the cost of the HEAD request.

What Changed

No-HEAD download path added, gated on response_checksum_validation='when_required': Downloads on that path start immediately with a ranged GET request for the first chunk. The default path is unchanged.
Dynamic size detection (no-HEAD path): Extract object size and ETag from the first GET response headers (ContentRange or ContentLength).
Dynamic chunk scheduling (no-HEAD path): After the first chunk completes, schedule additional chunks only if the object is larger than the chunk size.
ETag consistency for all chunks: When an ETag is available (either pre-provided via a subscriber, e.g. from a prior HEAD request by the AWS CLI, or extracted from the first GET response), all subsequent ranged GET requests include an IfMatch header with that ETag. This ensures S3 rejects the request if the object changes mid-download. The first GET request also includes IfMatch if an ETag was pre-provided before the download started.
Zero-byte object handling (no-HEAD path): The first ranged GET on a 0-byte object returns InvalidRange. The new path handles this by returning an empty response instead of retrying with a non-ranged GET.

Testing

Unit, functional, and integ tests pass. I also added a new script to benchmark downloading many small files (scripts/performance/time-batch-download.py). For 1000 × 1 KB files on my laptop, total download duration dropped ~41% (from 15.0s to 8.9s) on the opt-in path.

Backward Compatibility

External API unchanged. All download methods have the same signatures. The default path (HEAD + size-branching GET) is preserved, so no behavior change for callers who don't set response_checksum_validation='when_required'.

Flow Diagrams

Default path (unchanged)

flowchart TD
    A[HEAD Request] --> C{Size < 8MB?}
    C -->|Yes| D[GET Request]
    C -->|No| E[Multiple GET Requests]
    D --> F[Complete]
    E --> F
Loading

No-HEAD path (opt-in via response_checksum_validation="when_required")

flowchart TD
    A[GET First Chunk] --> B{Size < 8MB?}
    B -->|Yes| C[Complete]
    B -->|No| D[GET Remaining Chunks]
    D --> C
Loading

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Copy Markdown
Contributor

@aemous aemous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left 1 nit. Note that I am not the primary reviewer for this PR, and we are still pending the primary review.

Comment thread s3transfer/download.py Outdated
Comment thread s3transfer/download.py
Comment thread s3transfer/download.py
@aemous
Copy link
Copy Markdown
Contributor

aemous commented Mar 6, 2026

Looks like downloads may be failing when the object has a size of 0 bytes. The error I'm seeing is

botocore.exceptions.ClientError: An error occurred (InvalidRange) when calling the GetObject operation: The requested range is not satisfiable

Can you confirm whether you are able to reproduce this, and look into resolving it?

InvalidRange occurs when the range we are requesting has no overlap with the object itself. For 0 byte objects, this will always be the case if we are requesting the first 0-x bytes, since there are no bytes at all. The only feasible patch I can imagine here would be a try-except clause that handles InvalidRange by performing a non-ranged GET, which may be undesirable. If you have a better alternative for resolving this that doesn't involve needing an extra request, that would be preferred.

@crowecawcaw
Copy link
Copy Markdown
Author

I see the same issue. I added an integ test, fixed the unit test to mirror the same behavior, then fixed the issue. Instead of sending a second Get request for an object we know is empty, it just returns the empty content.

@aemous
Copy link
Copy Markdown
Contributor

aemous commented Mar 12, 2026

@crowecawcaw Can you push a revision that satisfies the 'Lint code' GH Action?

In case you cannot see the Action details, I pasted it here:

pre-commit hook(s) made changes.
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/scripts/performance/time-batch-download.py b/scripts/performance/time-batch-download.py
index 93e8a2a..1cf04fc 100755
--- a/scripts/performance/time-batch-download.py
+++ b/scripts/performance/time-batch-download.py
@@ -2,10 +2,12 @@
 """Direct timing of batch downloads without shell wrapper."""
 
 import argparse
-import tempfile
 import shutil
+import tempfile
 import time
+
 from botocore.session import get_session
+
 from s3transfer.manager import TransferManager
 
 
@@ -20,13 +22,13 @@ def main():
     parser.add_argument('--file-size', type=int, required=True)
     parser.add_argument('--s3-bucket', required=True)
     args = parser.parse_args()
-    
+
     session = get_session()
     client = session.create_client('s3')
-    
+
     tempdir = tempfile.mkdtemp()
     s3_keys = []
-    
+
     try:
         # Upload files
         print(f"Uploading {args.file_count} files...")
@@ -37,7 +39,7 @@ def main():
                 s3_key = f"perf_test_{i}"
                 manager.upload(file_path, args.s3_bucket, s3_key)
                 s3_keys.append(s3_key)
-        
+
         # Download files
         print(f"Downloading {args.file_count} files...")
         start_time = time.time()
@@ -46,9 +48,9 @@ def main():
                 download_path = f"{tempdir}/download_{i}"
                 manager.download(args.s3_bucket, s3_key, download_path)
         duration = time.time() - start_time
-        
+
         print(f"Download duration: {duration:.2f} seconds")
-        
+
         # Cleanup
         for s3_key in s3_keys:
             client.delete_object(Bucket=args.s3_bucket, Key=s3_key)

Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
@crowecawcaw crowecawcaw force-pushed the callback branch 5 times, most recently from b1fe8b8 to 5635ee3 Compare April 24, 2026 04:37
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 98.95833% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 81.14%. Comparing base (9281a7c) to head (62927b7).
⚠️ Report is 14 commits behind head on develop.

Files with missing lines Patch % Lines
s3transfer/download.py 98.95% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #363      +/-   ##
===========================================
+ Coverage    80.41%   81.14%   +0.72%     
===========================================
  Files           16       16              
  Lines         3013     3103      +90     
===========================================
+ Hits          2423     2518      +95     
+ Misses         590      585       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@crowecawcaw crowecawcaw force-pushed the callback branch 2 times, most recently from 9d25e90 to 65d4cde Compare April 30, 2026 20:54
Copy link
Copy Markdown
Contributor

@aemous aemous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loogs good overall. Left a few questions/comments.

Comment thread .changes/next-release/feature-download-85557.json Outdated
Comment thread s3transfer/download.py Outdated
Comment thread scripts/performance/time-batch-download.py
Comment thread tests/unit/test_download.py Outdated
@crowecawcaw crowecawcaw force-pushed the callback branch 2 times, most recently from afcb032 to 65d4cde Compare May 1, 2026 18:42
Address reviewer feedback that the HEAD request is needed by default to
enable full-object checksum validation on downloads. Only skip the HEAD
when the caller has opted out via botocore's
`response_checksum_validation = when_required` config. The prior HEAD +
size-based GET logic is restored as the default path.

Also:
- Add TestDownloadWhenRequiredChecksumValidation covering the HEAD-less
  path for non-empty and 0-byte objects (InvalidRange handling)
- Fix ruff import ordering / formatting in time-batch-download.py
  flagged by the Lint CI action
Copy link
Copy Markdown
Contributor

@aemous aemous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, 1 small change requested.

Comment thread s3transfer/download.py Outdated
def __call__(self):
# Always check if we have a task and response first
assert self._task is not None, (
"set_task() must be called before the task is submitted"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this assertion with raising a RuntimeError with a similar message wrapped? It would be more idiomatic with our codebase.

Replace the assert with a RuntimeError to match the codebase's
idiomatic handling of programmer errors, per PR review feedback.

Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
@crowecawcaw crowecawcaw requested a review from aemous May 11, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants